Case Study 35.2: Event-Driven Architecture with Kafka and DB2

Background

Meridian National Bank's fraud detection system operates as a nightly batch process. A COBOL program reads the day's transactions from DB2, applies 23 rule-based checks (unusual amounts, geographic anomalies, velocity patterns), and generates a fraud report that analysts review the next morning.

The problem: by the time analysts see the report, fraudulent transactions are 12-18 hours old. Stolen cards have been used for multiple purchases. Account takeover attacks have drained balances. The bank's fraud losses have increased 34% year over year.

The mandate: detect and respond to potential fraud within 30 seconds of a suspicious transaction.

The Challenge

Build a real-time fraud detection system that: - Processes every transaction within seconds of it hitting DB2 - Applies the same 23 rules as the batch system (regulatory requirement) - Adds ML-based anomaly detection for patterns the rules miss - Automatically freezes accounts when confidence exceeds a threshold - Notifies customers immediately via push notification and SMS - Maintains a complete audit trail for regulatory review - Operates alongside (not replacing) the existing batch system during a 6-month parallel run

Architecture

[DB2 z/OS]
  TRANSACTIONS table (CDC enabled)
  ACCOUNTS table (CDC enabled)
      |
      v  (InfoSphere Data Replication → Kafka)
[Kafka Cluster]
  meridian.cdc.TRANSACTIONS
  meridian.cdc.ACCOUNTS
      |
  +---+---+---+
  |       |       |
  v       v       v
[Rule     [ML      [Notification
 Engine]   Model]    Service]
  |       |       |
  +---+---+       |
      |           |
      v           v
[Fraud    [Push / SMS
 Decision  Gateway]
 Service]
      |
      v
[DB2: PROC_FREEZE_ACCOUNT]
[DB2: FRAUD_ALERTS table]

Implementation

Phase 1: CDC Pipeline

Enable CDC on DB2 z/OS:

The DBA team enabled data capture on the TRANSACTIONS and ACCOUNTS tables. On z/OS, this uses DB2's built-in log-based capture mechanism via InfoSphere Data Replication (IIDR).

ALTER TABLE MERIDIAN.TRANSACTIONS DATA CAPTURE CHANGES;
ALTER TABLE MERIDIAN.ACCOUNTS DATA CAPTURE CHANGES;

IIDR reads the DB2 recovery log (active and archive logs) and publishes change events to Kafka. The configuration specifies: - Capture latency target: 2 seconds - Buffering: batch events up to 100 or 1 second, whichever comes first - Error handling: on failure, retry 3 times then write to dead-letter queue

Kafka topic design:

Topic Partitions Key Retention
meridian.cdc.TRANSACTIONS 12 TRANSACTION_ID 30 days
meridian.cdc.ACCOUNTS 6 ACCOUNT_ID 7 days (compacted)
meridian.fraud.alerts 3 ACCOUNT_ID 90 days
meridian.fraud.decisions 3 TRANSACTION_ID 90 days
meridian.notifications.outbound 6 CUSTOMER_ID 1 day

Partitioning by ACCOUNT_ID for the CDC topics ensures all events for a single account are processed in order — critical for velocity checks.

Phase 2: Rule Engine

The rule engine is a Kafka Streams application that processes transaction events in real time. It maintains state stores for windowed aggregations:

Rule examples:

  1. Large transaction: Single transaction > $5,000 (or > 3x the account's average transaction).
  2. Velocity check: More than 5 transactions in 10 minutes from the same account.
  3. Geographic anomaly: Transaction from a location > 500 miles from the previous transaction within 2 hours.
  4. Round amount pattern: Three or more transactions for exact round amounts ($100, $200, $500) within 1 hour.
  5. New payee + large amount: First-ever transfer to a new recipient > $1,000.

Each rule produces a risk score (0-100). Scores are aggregated using a weighted formula.

State management: The rule engine uses Kafka Streams' RocksDB state stores to maintain per-account windows. Window sizes range from 10 minutes (velocity checks) to 30 days (behavioral baseline). State stores are backed by Kafka changelog topics for fault tolerance.

Phase 3: ML Anomaly Detection

A Python-based ML model runs alongside the rule engine. It consumes the same transaction events but applies a trained anomaly detection model (Isolation Forest) that learns each account's normal behavior patterns.

The model considers: - Transaction amount relative to historical distribution - Time of day relative to historical patterns - Merchant category relative to historical spending - Transaction frequency relative to historical cadence

The ML model produces an anomaly score (0-1). Scores above 0.85 are flagged as suspicious.

Phase 4: Decision Service

The decision service combines rule engine scores and ML scores:

final_score = (rule_score * 0.6) + (ml_score * 100 * 0.4)

if final_score >= 90:
    action = FREEZE_ACCOUNT      # Automatic freeze
elif final_score >= 70:
    action = ALERT_ANALYST       # Human review
elif final_score >= 50:
    action = MONITOR             # Enhanced monitoring
else:
    action = ALLOW               # Normal processing

For FREEZE_ACCOUNT actions, the decision service calls DB2 directly via JDBC:

// Call the stored procedure to freeze the account
CallableStatement cs = conn.prepareCall(
    "CALL MERIDIAN.PROC_FREEZE_ACCOUNT(?, ?, ?, ?, ?)");
cs.setInt(1, accountId);
cs.setString(2, "Automated fraud detection: score=" + finalScore);
cs.setString(3, "FRAUD");
cs.registerOutParameter(4, Types.INTEGER);  // status_code
cs.registerOutParameter(5, Types.VARCHAR);  // status_msg
cs.execute();

Phase 5: Notification Service

For alerts requiring customer notification, the notification service: 1. Reads from meridian.notifications.outbound 2. Looks up customer contact preferences in DB2 3. Sends push notifications (Firebase/APNS), SMS (Twilio), and/or email 4. Records delivery status back to DB2

Exactly-Once Processing

The fraud detection pipeline must not generate duplicate alerts (which would trigger multiple account freezes and confuse customers). The team implemented exactly-once semantics through:

  1. Idempotent Kafka producers (producer config: enable.idempotence=true)
  2. Transactional consumers (read-process-write in a Kafka transaction)
  3. DB2 deduplication table for the freeze operation:
CREATE TABLE MERIDIAN.PROCESSED_FRAUD_EVENTS (
    EVENT_ID         VARCHAR(100) NOT NULL PRIMARY KEY,
    TRANSACTION_ID   BIGINT NOT NULL,
    ACCOUNT_ID       INTEGER NOT NULL,
    PROCESSED_TS     TIMESTAMP NOT NULL DEFAULT CURRENT TIMESTAMP,
    ACTION_TAKEN     VARCHAR(30) NOT NULL
);

-- Before freezing, check:
SELECT COUNT(*) FROM MERIDIAN.PROCESSED_FRAUD_EVENTS
WHERE TRANSACTION_ID = ? AND ACCOUNT_ID = ?;

-- If 0, proceed with freeze and insert:
INSERT INTO MERIDIAN.PROCESSED_FRAUD_EVENTS
VALUES (?, ?, ?, CURRENT TIMESTAMP, 'FREEZE_ACCOUNT');

Parallel Run

During the 6-month parallel run, both the batch system and the real-time system operated simultaneously. The batch system remained the system of record; the real-time system's decisions were logged but not acted upon for the first month, then gradually enabled.

Comparison results:

Metric Batch System Real-Time System
Detection latency 12-18 hours 8 seconds (median)
True positive rate 82% 91%
False positive rate 12% 7%
Fraud losses (monthly) $847,000 | $312,000
Analyst review volume 1,200/day 340/day (high-confidence cases auto-handled)

Production Operations

Monitoring

The team monitors: - Kafka lag: If consumer lag exceeds 1,000 events, an alert fires (fraud detection is falling behind). - Processing latency: p99 latency from DB2 commit to fraud decision must be < 30 seconds. - False positive rate: Tracked daily. If it exceeds 10%, the ML model is retrained. - State store size: If RocksDB state stores exceed 80% of allocated disk, old windows are pruned.

Failure Handling

Failure Scenario Impact Mitigation
Kafka broker failure Events buffered in IIDR Kafka cluster has 3 brokers; tolerates 1 failure
Rule engine crash Events accumulate in Kafka Auto-restart via Kubernetes; processes backlog on recovery
ML model failure No anomaly scores Rule engine operates independently; ML scores default to 0
DB2 connection failure Cannot freeze accounts Circuit breaker with retry; alert operations team
CDC pipeline failure No events published IIDR stores offset in DB2; resumes from last committed position

Results After 12 Months

Metric Before After Improvement
Median detection time 15 hours 8 seconds 6,750x faster
Monthly fraud losses $847,000 | $289,000 66% reduction
Customer satisfaction (fraud response) 2.1/5 4.4/5 +110%
Analyst productivity 8 cases/hour 22 cases/hour 175% improvement
Annual fraud savings Baseline $6.7M

Lessons Learned

  1. Start with CDC, not application-level events. CDC captures every change, including those made by batch programs, DBA scripts, and other applications. Application-level events only capture what the application explicitly publishes.

  2. Exactly-once matters for financial systems. Duplicate fraud alerts cause real harm: duplicate account freezes, duplicate customer notifications, and duplicate investigation cases. Invest in idempotent processing from day one.

  3. The ML model needs the rule engine, not vice versa. The rule engine handles known fraud patterns immediately. The ML model catches novel patterns. During the ML model's first month, it had a 30% false positive rate. The rule engine carried the system while the model was retrained.

  4. Kafka lag is the most important operational metric. When lag increases, fraud detection slows. The team set up automated scaling: if lag exceeds 500 events for more than 2 minutes, a second consumer instance spins up.

  5. DB2 is still the system of record. Kafka is the transport; DB2 is the truth. Every fraud decision, account freeze, and investigation outcome is written back to DB2. The FRAUD_ALERTS table is the auditable record that regulators examine.

Discussion Questions

  1. The ML model's false positive rate dropped from 30% to 7% over six months. How would you design the feedback loop that improves the model over time?
  2. If the CDC pipeline experiences a 1-hour outage and then recovers, what happens to the fraud detection during and after the outage? How would you handle the backlog?
  3. A regulator asks you to prove that every transaction was evaluated by the fraud system. How would you design the reconciliation process between DB2 transactions and processed fraud events?