Case Study 35.2: Event-Driven Architecture with Kafka and DB2
Background
Meridian National Bank's fraud detection system operates as a nightly batch process. A COBOL program reads the day's transactions from DB2, applies 23 rule-based checks (unusual amounts, geographic anomalies, velocity patterns), and generates a fraud report that analysts review the next morning.
The problem: by the time analysts see the report, fraudulent transactions are 12-18 hours old. Stolen cards have been used for multiple purchases. Account takeover attacks have drained balances. The bank's fraud losses have increased 34% year over year.
The mandate: detect and respond to potential fraud within 30 seconds of a suspicious transaction.
The Challenge
Build a real-time fraud detection system that: - Processes every transaction within seconds of it hitting DB2 - Applies the same 23 rules as the batch system (regulatory requirement) - Adds ML-based anomaly detection for patterns the rules miss - Automatically freezes accounts when confidence exceeds a threshold - Notifies customers immediately via push notification and SMS - Maintains a complete audit trail for regulatory review - Operates alongside (not replacing) the existing batch system during a 6-month parallel run
Architecture
[DB2 z/OS]
TRANSACTIONS table (CDC enabled)
ACCOUNTS table (CDC enabled)
|
v (InfoSphere Data Replication → Kafka)
[Kafka Cluster]
meridian.cdc.TRANSACTIONS
meridian.cdc.ACCOUNTS
|
+---+---+---+
| | |
v v v
[Rule [ML [Notification
Engine] Model] Service]
| | |
+---+---+ |
| |
v v
[Fraud [Push / SMS
Decision Gateway]
Service]
|
v
[DB2: PROC_FREEZE_ACCOUNT]
[DB2: FRAUD_ALERTS table]
Implementation
Phase 1: CDC Pipeline
Enable CDC on DB2 z/OS:
The DBA team enabled data capture on the TRANSACTIONS and ACCOUNTS tables. On z/OS, this uses DB2's built-in log-based capture mechanism via InfoSphere Data Replication (IIDR).
ALTER TABLE MERIDIAN.TRANSACTIONS DATA CAPTURE CHANGES;
ALTER TABLE MERIDIAN.ACCOUNTS DATA CAPTURE CHANGES;
IIDR reads the DB2 recovery log (active and archive logs) and publishes change events to Kafka. The configuration specifies: - Capture latency target: 2 seconds - Buffering: batch events up to 100 or 1 second, whichever comes first - Error handling: on failure, retry 3 times then write to dead-letter queue
Kafka topic design:
| Topic | Partitions | Key | Retention |
|---|---|---|---|
| meridian.cdc.TRANSACTIONS | 12 | TRANSACTION_ID | 30 days |
| meridian.cdc.ACCOUNTS | 6 | ACCOUNT_ID | 7 days (compacted) |
| meridian.fraud.alerts | 3 | ACCOUNT_ID | 90 days |
| meridian.fraud.decisions | 3 | TRANSACTION_ID | 90 days |
| meridian.notifications.outbound | 6 | CUSTOMER_ID | 1 day |
Partitioning by ACCOUNT_ID for the CDC topics ensures all events for a single account are processed in order — critical for velocity checks.
Phase 2: Rule Engine
The rule engine is a Kafka Streams application that processes transaction events in real time. It maintains state stores for windowed aggregations:
Rule examples:
- Large transaction: Single transaction > $5,000 (or > 3x the account's average transaction).
- Velocity check: More than 5 transactions in 10 minutes from the same account.
- Geographic anomaly: Transaction from a location > 500 miles from the previous transaction within 2 hours.
- Round amount pattern: Three or more transactions for exact round amounts ($100, $200, $500) within 1 hour.
- New payee + large amount: First-ever transfer to a new recipient > $1,000.
Each rule produces a risk score (0-100). Scores are aggregated using a weighted formula.
State management: The rule engine uses Kafka Streams' RocksDB state stores to maintain per-account windows. Window sizes range from 10 minutes (velocity checks) to 30 days (behavioral baseline). State stores are backed by Kafka changelog topics for fault tolerance.
Phase 3: ML Anomaly Detection
A Python-based ML model runs alongside the rule engine. It consumes the same transaction events but applies a trained anomaly detection model (Isolation Forest) that learns each account's normal behavior patterns.
The model considers: - Transaction amount relative to historical distribution - Time of day relative to historical patterns - Merchant category relative to historical spending - Transaction frequency relative to historical cadence
The ML model produces an anomaly score (0-1). Scores above 0.85 are flagged as suspicious.
Phase 4: Decision Service
The decision service combines rule engine scores and ML scores:
final_score = (rule_score * 0.6) + (ml_score * 100 * 0.4)
if final_score >= 90:
action = FREEZE_ACCOUNT # Automatic freeze
elif final_score >= 70:
action = ALERT_ANALYST # Human review
elif final_score >= 50:
action = MONITOR # Enhanced monitoring
else:
action = ALLOW # Normal processing
For FREEZE_ACCOUNT actions, the decision service calls DB2 directly via JDBC:
// Call the stored procedure to freeze the account
CallableStatement cs = conn.prepareCall(
"CALL MERIDIAN.PROC_FREEZE_ACCOUNT(?, ?, ?, ?, ?)");
cs.setInt(1, accountId);
cs.setString(2, "Automated fraud detection: score=" + finalScore);
cs.setString(3, "FRAUD");
cs.registerOutParameter(4, Types.INTEGER); // status_code
cs.registerOutParameter(5, Types.VARCHAR); // status_msg
cs.execute();
Phase 5: Notification Service
For alerts requiring customer notification, the notification service:
1. Reads from meridian.notifications.outbound
2. Looks up customer contact preferences in DB2
3. Sends push notifications (Firebase/APNS), SMS (Twilio), and/or email
4. Records delivery status back to DB2
Exactly-Once Processing
The fraud detection pipeline must not generate duplicate alerts (which would trigger multiple account freezes and confuse customers). The team implemented exactly-once semantics through:
- Idempotent Kafka producers (producer config:
enable.idempotence=true) - Transactional consumers (read-process-write in a Kafka transaction)
- DB2 deduplication table for the freeze operation:
CREATE TABLE MERIDIAN.PROCESSED_FRAUD_EVENTS (
EVENT_ID VARCHAR(100) NOT NULL PRIMARY KEY,
TRANSACTION_ID BIGINT NOT NULL,
ACCOUNT_ID INTEGER NOT NULL,
PROCESSED_TS TIMESTAMP NOT NULL DEFAULT CURRENT TIMESTAMP,
ACTION_TAKEN VARCHAR(30) NOT NULL
);
-- Before freezing, check:
SELECT COUNT(*) FROM MERIDIAN.PROCESSED_FRAUD_EVENTS
WHERE TRANSACTION_ID = ? AND ACCOUNT_ID = ?;
-- If 0, proceed with freeze and insert:
INSERT INTO MERIDIAN.PROCESSED_FRAUD_EVENTS
VALUES (?, ?, ?, CURRENT TIMESTAMP, 'FREEZE_ACCOUNT');
Parallel Run
During the 6-month parallel run, both the batch system and the real-time system operated simultaneously. The batch system remained the system of record; the real-time system's decisions were logged but not acted upon for the first month, then gradually enabled.
Comparison results:
| Metric | Batch System | Real-Time System |
|---|---|---|
| Detection latency | 12-18 hours | 8 seconds (median) |
| True positive rate | 82% | 91% |
| False positive rate | 12% | 7% |
| Fraud losses (monthly) | $847,000 | $312,000 | |
| Analyst review volume | 1,200/day | 340/day (high-confidence cases auto-handled) |
Production Operations
Monitoring
The team monitors: - Kafka lag: If consumer lag exceeds 1,000 events, an alert fires (fraud detection is falling behind). - Processing latency: p99 latency from DB2 commit to fraud decision must be < 30 seconds. - False positive rate: Tracked daily. If it exceeds 10%, the ML model is retrained. - State store size: If RocksDB state stores exceed 80% of allocated disk, old windows are pruned.
Failure Handling
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Kafka broker failure | Events buffered in IIDR | Kafka cluster has 3 brokers; tolerates 1 failure |
| Rule engine crash | Events accumulate in Kafka | Auto-restart via Kubernetes; processes backlog on recovery |
| ML model failure | No anomaly scores | Rule engine operates independently; ML scores default to 0 |
| DB2 connection failure | Cannot freeze accounts | Circuit breaker with retry; alert operations team |
| CDC pipeline failure | No events published | IIDR stores offset in DB2; resumes from last committed position |
Results After 12 Months
| Metric | Before | After | Improvement |
|---|---|---|---|
| Median detection time | 15 hours | 8 seconds | 6,750x faster |
| Monthly fraud losses | $847,000 | $289,000 | 66% reduction | |
| Customer satisfaction (fraud response) | 2.1/5 | 4.4/5 | +110% |
| Analyst productivity | 8 cases/hour | 22 cases/hour | 175% improvement |
| Annual fraud savings | Baseline | $6.7M | — |
Lessons Learned
-
Start with CDC, not application-level events. CDC captures every change, including those made by batch programs, DBA scripts, and other applications. Application-level events only capture what the application explicitly publishes.
-
Exactly-once matters for financial systems. Duplicate fraud alerts cause real harm: duplicate account freezes, duplicate customer notifications, and duplicate investigation cases. Invest in idempotent processing from day one.
-
The ML model needs the rule engine, not vice versa. The rule engine handles known fraud patterns immediately. The ML model catches novel patterns. During the ML model's first month, it had a 30% false positive rate. The rule engine carried the system while the model was retrained.
-
Kafka lag is the most important operational metric. When lag increases, fraud detection slows. The team set up automated scaling: if lag exceeds 500 events for more than 2 minutes, a second consumer instance spins up.
-
DB2 is still the system of record. Kafka is the transport; DB2 is the truth. Every fraud decision, account freeze, and investigation outcome is written back to DB2. The FRAUD_ALERTS table is the auditable record that regulators examine.
Discussion Questions
- The ML model's false positive rate dropped from 30% to 7% over six months. How would you design the feedback loop that improves the model over time?
- If the CDC pipeline experiences a 1-hour outage and then recovers, what happens to the fraud detection during and after the outage? How would you handle the backlog?
- A regulator asks you to prove that every transaction was evaluated by the fraud system. How would you design the reconciliation process between DB2 transactions and processed fraud events?