Case Study 1: CNB's Data Warehouse Feed Architecture
Background
Central National Bank's data warehouse serves 200 analysts, powers executive dashboards, drives regulatory reporting, and feeds the machine learning models that underpin CNB's fraud detection and customer analytics programs. The warehouse — a Teradata system with 180 terabytes of historical data — receives nightly feeds from 23 mainframe source systems. The largest and most critical feed is the account master extract: 14.2 million accounts, 3,700 bytes per record, producing a 49.6 GB file every night.
Kwame Asante, the mainframe systems architect, inherited this integration in 2018. At the time, it was a mess. The extract job ran for four hours. The FTP transfer took another three hours. The warehouse load took two hours. Total elapsed time: nine hours — blowing past the six-hour batch window by three hours every night and delaying morning reporting.
Rob Fielding, the integration architect, and Lisa Park, the senior systems programmer, spent six months redesigning the feed. This case study documents what they built, why they built it that way, and what they learned.
The Problem in Detail
Volume and Growth
The account master file grows by approximately 8% per year as CNB acquires new customers and business lines. In 2018, the file was 38 GB. By 2026, it reached 49.6 GB. Extrapolating forward, it will exceed 70 GB by 2030. Any solution must scale gracefully.
Data Freshness Requirements
The warehouse must reflect end-of-day balances by 6:00 AM Eastern. Analysts begin running queries at 6:30 AM. Executive dashboards refresh at 7:00 AM. The regulatory reporting team pulls data at 8:00 AM. Missing the 6:00 AM window cascades into delayed reports, missed regulatory deadlines, and angry executives.
Existing Architecture (2018)
Step 1: COBOL extract program reads VSAM ACCT-MASTER
→ Writes sequential file
→ Elapsed: 4 hours (VSAM I/O bound)
Step 2: FTP transfer to warehouse server
→ Plain FTP, no compression, no checkpoint
→ Elapsed: 3 hours (network bound)
Step 3: Teradata load utility
→ Elapsed: 2 hours
→ Total: 9 hours
The extract program was written in 1997 and had never been tuned. It read the VSAM file sequentially but processed each record through 47 copybook-to-flat-file field mappings with individual MOVE statements and extensive date conversion logic. The FTP transfer used plain ASCII mode (requiring character conversion at both ends) with no compression or checkpoint/restart. A failed transfer at 90% meant starting over.
Pain Points
- Batch window overrun. The feed consistently finished at 5:00-6:00 AM, frequently past the 6:00 AM deadline.
- No restart capability. Any failure in the extract, transfer, or load required a complete restart from the beginning.
- No validation. The warehouse loaded whatever arrived. Truncated files, corrupted records, and stale data had all been loaded without detection.
- No monitoring. Nobody knew the feed was late until analysts complained.
- Single point of failure. One FTP server, one network path, one load process.
The Redesigned Architecture
Phase 1: Extract Optimization
Rob's first priority was the four-hour extract. He profiled the COBOL program and found three problems:
I/O configuration. The VSAM ACCT-MASTER file had CI (Control Interval) size of 4,096 bytes — far too small for sequential processing. Records averaged 3,700 bytes, meaning one record per CI with wasted space. Rob worked with the storage team to reorganize the file with 26,624-byte CIs, fitting seven records per CI and dramatically reducing I/O count.
Buffer allocation. The extract JCL allocated only 5 buffers for the VSAM input. For sequential processing, more buffers enable read-ahead. Rob increased to 50 buffers:
//ACCTMSTR DD DSN=CNB.PROD.ACCT.MASTER,
// DISP=SHR,
// AMP=('BUFND=50,BUFNI=10')
Processing logic. The 47 individual MOVE statements with inline date conversion were replaced with a single block MOVE for the fixed-format portion and a call to XFORMUTIL for date fields. Lisa rewrote the date conversion to use a table-driven approach instead of nested IF statements.
Result: extract time dropped from 4 hours to 55 minutes.
Phase 2: Transfer Redesign
Lisa replaced FTP with Connect:Direct, implementing the full GDG handoff pattern:
GDG base with LIMIT(30):
DEFINE GDG -
NAME(CNB.PROD.ACCT.EXTRACT) -
LIMIT(30) -
NOEMPTY -
SCRATCH
Connect:Direct process with compression and checkpoint:
ACCT_DW PROCESS SNODE=TDATA01
STEP01 COPY FROM (
DSN='CNB.PROD.ACCT.EXTRACT(0)'
DISP=SHR
SYSOPTS="DATATYPE=TEXT"
PNODE)
TO (
FILE=/data/warehouse/staging/acct_extract.dat
DISP=RPL
SYSOPTS="DATATYPE=TEXT"
SNODE)
COMPRESS EXTENDED
CHECKPOINT=500000
STEP02 RUN TASK (
PGM='/opt/cnb/scripts/validate_extract.sh'
PARM='acct_extract.dat 14200000'
SNODE)
IF (STEP02 EQ 0) THEN
STEP03 RUN TASK (
PGM='/opt/cnb/scripts/trigger_load.sh'
PARM='acct_extract.dat ACCOUNT_STG'
SNODE)
ELSE
STEP04 RUN TASK (
PGM='/opt/cnb/scripts/alert_failure.sh'
PARM='ACCT_DW VALIDATION_FAILED'
SNODE)
ENDIF
ACCT_DW ENDPROCESS
Key design decisions:
- COMPRESS EXTENDED reduced the 49.6 GB file to approximately 14 GB during transfer (72% compression). Fixed-length COBOL records with repetitive data patterns compress exceptionally well.
- CHECKPOINT=500000 enables restart at the last half-million-record boundary. A failure at record 12 million loses at most 500,000 records of transfer progress rather than the entire file.
- Post-transfer validation (STEP02) runs on the receiving node, checking record count against the expected value embedded in the control record. The load only proceeds if validation passes.
Result: transfer time dropped from 3 hours to 25 minutes.
Phase 3: Control Records and Validation
Rob designed a two-record control structure — a header at the beginning and a trailer at the end:
Header Record (Record Type 'H'):
01 WS-HEADER-RECORD.
05 WS-HDR-REC-TYPE PIC X(01) VALUE 'H'.
05 WS-HDR-FILE-ID PIC X(12)
VALUE 'ACCT-EXTRACT'.
05 WS-HDR-BUS-DATE PIC 9(08).
05 WS-HDR-CREATE-TS PIC X(26).
05 WS-HDR-GDG-GEN PIC 9(04).
05 WS-HDR-SOURCE-SYS PIC X(08) VALUE 'MAINFRM1'.
05 WS-HDR-EXTRACT-VER PIC X(04) VALUE '03.2'.
05 FILLER PIC X(3637).
Trailer Record (Record Type 'T'):
01 WS-TRAILER-RECORD.
05 WS-TRL-REC-TYPE PIC X(01) VALUE 'T'.
05 WS-TRL-REC-COUNT PIC 9(09).
05 WS-TRL-HASH-TOTAL PIC S9(15) COMP-3.
05 WS-TRL-MIN-ACCT PIC X(12).
05 WS-TRL-MAX-ACCT PIC X(12).
05 WS-TRL-END-TS PIC X(26).
05 FILLER PIC X(3632).
The warehouse load process validates: 1. Header record type is 'H' and file ID matches expected value 2. Business date is the expected date (not a stale file) 3. Extract version is compatible with the load process 4. Trailer record count matches actual data records read 5. Hash total of all balance fields matches the trailer hash 6. Account number range (min/max) is reasonable
Any validation failure aborts the load and triggers an alert. In the first year after implementation, these checks caught three incidents that would previously have loaded bad data into the warehouse.
Phase 4: Incremental CDC Feed
The full nightly extract handles the warehouse's primary data needs, but analysts also wanted intraday visibility into account changes. Rob implemented a CDC-based incremental feed:
Application-level CDC in the COBOL update programs:
Every program that modifies the ACCT-MASTER VSAM file was modified to call the CDCWRITE subprogram, which writes a change record containing the before image, after image, operation type, timestamp, and program ID.
MQ publication of CDC events:
A COBOL program runs continuously (started task) reading CDC records and publishing them to MQ topic CNB/ACCOUNTS/CHANGES. The warehouse team subscribes to this topic and applies changes to a staging table in near-real-time.
Nightly reconciliation:
At 6:00 AM, after the full extract loads, a reconciliation job compares the warehouse's account table (built from accumulated CDC events) against the freshly loaded full extract. Discrepancies are logged and investigated. In the first three months, the reconciliation found an average of 12 discrepancies per night, all traced to timing windows — changes that occurred between the extract snapshot time and the last CDC event processed.
Phase 5: Monitoring and Alerting
Lisa built a monitoring framework with five layers:
- Job monitoring. CA7 (now Broadcom AutoSys) tracks every job in the feed chain. Late starts, long runtimes, and non-zero return codes trigger alerts.
- Transfer monitoring. Connect:Direct statistics are scraped into a monitoring dashboard showing transfer rates, compression ratios, and elapsed times.
- Validation monitoring. Control record validation results are logged and trended. A sudden change in record count (more than 2% deviation from the 30-day average) triggers an investigation.
- Reconciliation monitoring. CDC-vs-full-extract discrepancy counts are trended. Rising discrepancy counts indicate a CDC gap.
- SLA monitoring. A simple check: is the warehouse loaded and available by 6:00 AM? If not, escalate immediately.
Results
| Metric | Before (2018) | After (2019) | Current (2026) |
|---|---|---|---|
| Extract time | 4 hours | 55 minutes | 62 minutes |
| Transfer time | 3 hours | 25 minutes | 28 minutes |
| Load time | 2 hours | 1.5 hours | 1.5 hours |
| Total elapsed | 9 hours | 2 hours 50 min | 2 hours 30 min |
| SLA met rate | 72% | 99.4% | 99.8% |
| Undetected data issues | ~3/year | 0 | 0 |
| Recovery time after failure | Full restart (9 hrs) | Checkpoint restart (~30 min) | Checkpoint restart (~30 min) |
The total elapsed time has remained stable despite 30% data growth because Connect:Direct compression scales with data size and the extract program's performance is dominated by I/O count (which grows linearly with records, not record size).
Lessons Learned
1. Tune before you re-architect. The extract optimization (VSAM CI size, buffer allocation, code cleanup) cut the extract from 4 hours to 55 minutes without changing the architecture. Always look for tuning wins first.
2. Compression is free performance. Connect:Direct compression converted a 3-hour FTP transfer into a 25-minute compressed transfer. The CPU cost of compression is negligible compared to the I/O and network savings.
3. Control records save you at 3 AM. Three data quality incidents caught by control record validation in year one. Each one would have corrupted the warehouse and required hours of investigation and data repair.
4. CDC complements batch; it doesn't replace it. The CDC feed provides intraday freshness, but the nightly full extract remains the authoritative load. CDC is prone to gaps and timing issues that the full extract reconciles.
5. Monitor for absence, not just failure. The most dangerous scenario is when a job doesn't run at all. No job means no failure, no alert, no notification. Proactive monitoring that expects activity and alerts on silence catches this class of problem.
Discussion Questions
-
CNB's file grows by 8% per year. At what point does the full nightly extract become untenable, and what would replace it? Calculate the year when the extract exceeds the batch window even with current optimizations.
-
The CDC-vs-full-extract reconciliation consistently finds ~12 discrepancies per night due to timing windows. Is this acceptable? What design changes could reduce this number to zero?
-
Rob chose application-level CDC rather than log-based CDC. What are the risks of this approach, and how would you mitigate the risk of a new program being added that modifies ACCT-MASTER without calling CDCWRITE?
-
The architecture has a single Connect:Direct transfer path. How would you add redundancy? Consider both network path redundancy and the scenario where the warehouse server is unavailable.
-
Lisa's monitoring framework has five layers. Rank them by importance and justify your ranking. If you could only implement two, which two would you choose?